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1. INTRODUCTION 

Natural languages contain a few words with a different meaning in a different context [1]. 
Human can easily distinguish it because we have the ability to see the context of the sentence and determine 
the meaning of the ambiguous words. To make computers understand the meaning of an ambiguous word, 
it requires a very difficult technique. Therefore, Word Sense Disambiguation (WSD) is existed to determine 
the meaning of ambiguous words [2]. For example, the ambiguous word is the word 'bank': 

1. “He sat down beside the Seine river bank” [3]. 

2. “He deposited the money at the Chase bank” [3]. 

The word bank in both sentences has a different meaning. In the first sentence, it means a place near the 
river, while in the second sentence, it means a financial institution. 

Word sense disambiguation is very important problem because it has many uses such as machine 
translation [4] or sentiment analysis [5]. In machine translation, translating a sentence containing the 
ambiguous words cannot be done directly without looking the context. Otherwise, it can be wrong. 
he accuracy of machine translation in translating words can be improved [6]. One of the examples is by using 
Word sense disambiguation. 

Many researchers have proposed various approaches to solve word sense disambiguation problems, 
but none of it can handle inexistent words in a corpus. In [7], proposed the uses of adapted weighted graph to 
solve the problem. In [8], proposed the uses of machine learning to solve the problem. Another way to solve 
word sense disambiguation is by using corpus. Corpus is a set of structured text that has many uses. One of 





Journal homepage: http://iaescore.com/journals/index.php/ijeecs 


1240 O ISSN: 2502-4752 


them can be used to classify emotions from music [9], emotions from a text [10], and word sense 
disambiguation. In [3], proposed a word sense disambiguation solution using Skip-Gram corpora. In [3], 
Google Word Sense Disambiguation Corpora as the corpora and achieved a result with accuracy 42.12%. 
However, the existed method using corpus did not handle problem if there is no word from sentence that are 
in the corpus. 

In this research, the use of Wikipedia and Word2vec is proposed to develop the corpora. 
Meanwhile, Lesk algorithm and Wu Palmer similarity are used to handle problem if there is no word from 
sentence that are in corpus. First, two corpora are developed using data from Wikipedia. The data obtained 
from Wikipedia then preprocessed to minimize the words variations [11]. After preprocessing the data, 
corpora are created using Word2vec. Second, the corpora are used to determine the meaning of an ambiguous 
word. To conduct this, the similarity of a sentence to the first and the second corpus is calculated using 
cosine similarity. If there are any words from the sentence that do not belong to corpora, Lesk algorithm [12] 
and Wu Palmer [13] are used to calculate the similarity. Then, the meaning of an ambiguous words is 
determined based on the value of similarity that has been calculated. 


2. RESEARCH METHOD 

The main objective of this research is to develop the corpora and to use it as a tool to solve word 
sense disambiguation problem. The proposed method is divided into three parts; the first part is developing 
the Wikipedia corpora; the second part is determining the result; the third part is performance measure. 


2.1. Developing Wikipedia Corpora 

Figure 1 shows the process of developing the corpora. The two datasets from Wikipedia, such as a 
word “bank (financial)y’ and “bank (geography)”’, are used as an input to be preprocessed. Then, 
the preprocessed data are used to create two corpora using word2vec. Corpus | and corpus 2 are the output of 
each dataset. 





/ Dataset 4 f- Dataset 2 





Preprocessing Data: 
1. Lowercase 
2. Remove Punctuation 
3. Tokenize 
4. POS Tagginh 
5. Lemmatize 
6. Remove stop words 








Create word2vec corpora: 
1. Set dimension = 100 
2. Set window = 5 
3. Set minimum words appeared = 10 


4 Corpus 1 / E Corpus 2 / 


Figure 1. Corpora development 








2.1.1 Dataset 

The Wikipedia article has many features including a table of contents, article references and 
category labels. In this paper, the use of category label feature from Wikipedia article is proposed to 
determine which article will be selected as a dataset to develop the corpus. For example, we develop corpora 
for word “bank”, the first corpus is bank as a financial institution and the second corpus is bank as 
geography. The article selection is based on the category labels from Wikipedia. For bank as a financial 
institution, the Wikipedia articles that contains word “bank” with category labels related to financial 
institution are selected. The categories we found that related to financial institution are Banks, 
Banking, Legal Entities, Italian Inventions, and Economic History of Italy. For bank as geography, 
the category labels are Hydrology, Geomorphology, Limnology, Freshwater Ecology, Fluvial Landforms, 
Riparian Zone, Rivers, Water Streams, and Water and the Environment. After choosing the articles, then the 
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content of the article in the paragraph is obtained. Paragraphs taken from Wikipedia are then broken down 
into sentences based on period punctuation. The process can be seen in Figure 2. 





Search article that contain 
ambiguous word 








Look at the category 
labels 





Determine the article 


Figure 2. Article selection 


2.1.2 Preprocessing 
The sentences from the dataset have so many variations. This condition makes the process for 

creating the corpora will be less accurate. The preprocess itself has six steps to do. 

a) Lowercase 
This is the simple way to make the words variations to be less. For example, if there are two words 
“Money” and “money” it will be recognized as same word. 

b) Remove punctuation 
In building the corpora and testing it with our testing data, we only need the words. Therefore, the 
punctuations are removed. 

c) Tokenize 
We tokenize the input sentence to make it easier to be processed at the next step, which is POS tagging 
and Lemmatizing. 

d) POS tagging 
To make the data more accurate we use POS tagging. Part of Speech (POS) Tagging is commonly used 
to determine whether a word is a noun word, a verb, an adjective or an adverb. 

e) Lemmatize 
This is the part important process to make the data to have less variations. We will make words like 
“banks” to be same as the word “bank”. We lemmatized the words based on the POS tagging of the 
words, so the lemmatized words will be more accurate. 

f) Remove the stop word 
Stop words are words that do not contain significant meaning when it is used to create a corpus. For 
example, are "the" and "to be" words, both do not provide a significant meaning to the context of the 
sentences. Table | shows the example of preprocessing result. 


Table 1. Preprocessing Result 








Input Output 
My current bank deposit account interest rate has just been current bank deposit account interest rate cut 
cut again. 
Most people have a current account and most banks pay people current account bank pays virtually interest 
virtually no interest on this money. money 





2.1.3 Create Word2vec Corpora 

The corpora is developed using word2vec word embedding technique [14]-[16] on Google using 
data obtained from the content of Wikipedia articles. Word2vec is used because it has two layers of neural 
networks used to produce word embedding in a vector space. In vector spaces, words that share common 
contexts will converge in adjacent places [14]. 
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There are two ways to create a corpus, such as the Continuous Bag of Words (CBOW) model and 
the Skip-Gram model [14]. The CBOW model predicts the words based on the given context whereas Skip- 
gram predicts words that surround the given word [14]. The preprocessed datasets are used as an input to 
Word2vec. Since this research does not have large datasets for training, Skip-gram model is used because it 
has a better solution in handling infrequent words than CBOW model. Skip-gram model is used with a 
hundred-dimensional vector and with window five words and minimum word appear ten times. 


2.2. Determine the Result 

Figure 3 shows the process of determining the result. We use testing data from Oxford English 
Dictionary and Yourdictionary.com to be preprocessed. Then, the sentence similarity with the corpora is 
calculated to determine the result. 


Testing Data 


Preprocessing Data: 
1. Lowercase 
2. Remove punctuation 
3. Tokenize 
4.POS Tagging 
5. Lemmatize 
6. Remove stop words 















Semantic Similarity: 
1. Lesk algorithm 
2. Wu Palmer similarity 












is there a word from the sentence 
inside the corpora? 








Cosine Similarity 
1. Word cosine similarity 


2. Sentence similarity 





Figure 3. Determine the Result 


2.2.1 Testing Data 
a) Oxford English Dictionary 

The Oxford English Dictionary (OED) is the largest English dictionary widely used by people to 
search for word definitions or search for sentence examples from a word. Therefore, OED is used as 
testing data. 
b) Yourdictiornary.com 

Yourdictionary.com is a free online English dictionary that has many sample sentences, famous 
quotes, and audio pronunciations. In this dictionary, examples of sentences are made by internet users, so the 
data will have many sentence variations. Therefore, it is used to test the proposed method. 


2.2.2 Preprocessing Data 
The preprocessing step for testing data is the same as preprocessing step for developing corpora. 


2.2.3 Similarity to Corpora 
a) Cosine similarity 

Cosine similarity is the calculation between two vectors with the result of an angle between 
them [17]. Cosine similarity produces results with intervals between -1 and 1. The formula for cosine 
similarity is; 
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cosine similarity = cos(@) 
A-B 


~ TANI (1) 


i=1 iP i 


Vo Ai Voi Be 


where A; and B; are components of word2vec vectors A and B, respectively. 
b) Sentence similarity 

Every word in a sentence except the ambiguous word itself are calculated using cosine similarity 
with the ambiguous word contained in the corpus [18]. The ambiguous word in the sentence and the word 
from sentence that is not in corpus will be given 0 value. The words from sentences that have been calculated 
using cosine similarity are then averaged. 


n 
1 
sentence similarity = -* > Xj (2) 


i=1 


where, 

n = number of words from sentence 

x = cosine similarity of the words from sentence with the ambiguous word 
c) Determine the result 

The meaning of the ambiguous word in a sentence is determined by the value of sentence similarity 
that has been calculated. If the value of sentence similarity to corpus one is higher than corpus two, then the 
meaning of the ambiguous word present in a sentence is as defined by corpus one and vice versa. For 
example, Table 2 shows the calculation with preprocessed input sentence “current bank deposit account 
interest rate cut” with the corpus | ‘bank’ as a financial institution and the corpus 2 ‘bank’ as geography. 


Table 2. Cosine Similarity Result 
Cosine similarity with word ‘bank’ 





Word from sentence 





in corpus | in corpus 2 
Current 0.838 0 (not in corpus) 
Bank 0 (ambiguous word) 0 (ambiguous word) 
Deposit 0.949 0.983 
Account 0.952 0 (not in corpus) 
Interest 0.925 0 (not in corpus) 
Rate 0.895 0.986 
Cut 0 (not in corpus) 0.992 
Sentence similarity 0.651 0.423 





2.2.4 Semantic Similarity 
a) Lesk algorithm 

Lesk algorithm is a classical algorithm for word sense disambiguation. In this paper, the simplified 
Lesk algorithm is used because it has a better performance [12]. This algorithm is shown in Figure 4. It 
calculates the overlapping words between the input sentence and the sentence from word definition and 
example in dictionary. In this case, Wordnet is used as the dictionary. 


The bank can guarantee} ill eventually cover future tuition cost 
because it invest in adjustable-rate] mortgage securities. 





Definition a financial institution that accepts) deposits) and channels the 


money into lending activities 
Example “he cashed a check at the bank"; “that bank holds the 
[mortgage]on my home" 
Definition sloping land (especially the slope beside a body of water) 
Example “they pulled the canoe up on the bank"; "he sat on the bank 





of the river and watched the currents" 








Output bank 


Figure 4. Simplified Lesk Algorithm 
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Sentences that do not have a single word contained in the corpora are then used as the input into this 
algorithm. The output of this algorithm is one of the words in Wordnet and will be used in the next step. 
b) Wu Palmer Similarity 

Wu Palmer similarity [13] is one of many algorithm that measures the semantic similarity of two 
words based on the Wordnet tree. 

The formula for calculating similarity using Wu Palmer is 


2 * Depth(LCS) 


1 . il it SS oe 
wu palmer similarity (Depth(a) + Depth(b)) 


(3) 


where, 

LCS = Least Common Subsumer (parent of the two words searched) 
a = the first word 

b = the second word 


The word resulted from Lesk algorithm then measured with the real meaning of ambiguous word in 
Wordnet using Wu Palmer similarity and the output score will be used to determine the result. For example, 
the output from Lesk algorithm is “slope”, then the word “slope” measured with the word “bank” as a 
financial institution and word “bank” as geography in Wordnet. 

c) Determine the result 

The meaning of the ambiguous word in a sentence is determined by the value of Wu Palmer 
similarity that has been calculated. If the value of Wu Palmer similarity to corpus one is higher than corpus 
two, then the meaning of the ambiguous word present in a sentence is as defined by corpus one and vice 
versa 


2.3. Performance Measure 
To evaluate the proposed method, these following formulas are used 








( TS, fi TS } 
Gti (TS, + FS,)  (TS2 + FS>) (4) 
2 
( TS, + TS, ) 
Pe ee (TS, + FS,) (TS, + FS,) (5) 
2 

Precission x Recall 

F1S = 6 
pene ea recission + Reca i 

Precissi Recall (6) 

‘i _ TS, + TS2 7 
oes (Total Data) ) 

where, 


TS, = True prediction of the first sense 
FS, = False prediction of the first sense 
TS = True prediction of the second sense 
FS, = False prediction of the second sense 


3. RESULTS AND ANALYSIS 

In this paper, Python programming language is implemented to propose the method. To get the 
articles from Wikipedia, we use content function from Wikipedia python library. The nltk python library is 
used to preprocess the data from Wikipedia and gensim python library is used to create the word2vec 
corpora. The amount of the testing data we used can be seen in Table 3. Table 4 shows the experiment result 
without semantic similarity. Since there is no word from the sentence inside both corpora, the sentence 
similarity will have 0 value. Therefore, the precision, recall, and Flscore value cannot be calculated. We can 
only calculate the accuracy. 
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Table 3. Testing Data 








Ambiguous words Senses Wikipedia Dataset Testing Data 
(sentences) (sentences) 
Bank Financial Institution & Geography 335 138 
Plant Factory & Biology 298 80 
Heart Feeling & Organ 369 40 
Average - 334 86 





Table 4. Experiment Results of Cosine Similarity Without Semantic Similarity 
Ambiguous word Unknown sentences Accuracy (%) 








Bank 29 73.72 
Plant 10 81.25 
Heart 2 77.50 
Average 13,6 77.49 





Table 5 is the second results that presents the experiment result with semantic similarity. Since there 
is no 0 value of the semantic similarity, we can calculate the precision, recall, and Flscore. As can be seen in 
Table 5, if we use semantic similarity when there is no word from the sentence inside both corpora, 
the accuracy result is improved by 8.02%. 


Table 5. Experiment Result of Cosine Similarity with Semantic Similarity 
Ambiguous word Precision (%) Recall (%) Fl Score (%) Accuracy (%) 








Bank 88.21 89.33 88.76 89.05 
Plant 85.00 85.00 85.00 85.00 
Heart 82.50 82.58 82.54 82.50 
Average 85.23 85.63 85.43 85.51 





4. CONCLUSION 

This research proposes the use of Wikipedia and Word2vec to develop the corpora. The additional 
algorithm such as Lesk algorithm and Wu Palmer similarity are used to handle inexistent words in a corpus. 
The results of our proposed method to solve word sense disambiguation problems show an accuracy rate of 
85.51% and the semantic similarity can improve the accuracy rate by 8.02%. For further research, the process 
for handling words from a sentence that are not in the corpora with cosine similarity is still lacking so that it 
can be developed to achieve better accuracy. 
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